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Abstract. The extensible Markup Language (XML) provides a powerful and flexible means of en- 
coding and exchanging data. As it turns out, its main advantage as an encoding format (namely, its 
requirement that all open and close markup tags are present and properly balanced) yield also one of its 
main disadvantages: verbosity. XML-conscious compression techniques seek to overcome this drawback. 
Many of these techniques first separate XML structure from the document content, and then compress 
each independently. Further compression gains can be realized by identifying and compressing together 
document content that is highly similar, thereby amortizing the storage costs of auxiliary information 
required by the chosen compression algorithm. Additionally, the proper choice of compression algorithm 
is an important factor not only for the achievable compression gain, but also for access performance. 
Hence, choosing a compression configuration that optimizes compression gain requires one to determine 
(1) a partitioning strategy for document content, and (2) the best available compression algorithm to 
apply to each set within this partition. In this paper, we show that finding an optimal compression 
configuration with respect to compression gain is an NP-hard optimization problem. This problem re- 
mains intractable even if one considers a single compression algorithm for all content. We also describe 
an approximation algorithm for selecting a partitioning strategy for document content based on the 
branch-and-bound paradigm. 



1 Introduction 

The extensible Markup Language (XML) has become increasingly popular as a data encoding format. XML 
has many benefits, but one notable weakness: its verbosity, resulting from the high markup-to-content ratio 
imposed in large part by requiring every markup tag to be properly closed. The increasing size of XML 
datascts has motivated researchers to seek ways to reduce storage costs by applying compression techniques. 
Because XML is inherently a textual format, the naive solution is to apply a generic text compression scheme. 
However, such schemes are not aware of XML syntax, and therefore cannot easily exploit redundancies in 
the tree structure unambiguously induced by the proper nesting of markup tags inside the XML document 
(such as repeated subtrees), or even distinguish an element tag from a text segment. Thus, such a strategy 
severely hinders query processing, which is fundamentally based on traversing the structure of the document. 

With such shortcomings in mind, many XML-conscious compression techniques have been proposed in 
recent years. Among them, homomorphic approaches to XML compression (e.g., [1-6]) preserve the original 
tree structure in the compressed representation by processing each node as it occurs during a pre-order 
traversal. Perrn,ut,ation-hased approaches (e.g., [7 11]) re-arrange the document before performing compres- 
sion, in an attempt to group "similar" nodes together and therefore improve the achievable compression 
rate. A commonly used permutation strategy treats structure separately from content, and then applies a 
partitioning strategy to group content nodes into a series of data containers. How(n'cr. there is an inherent 
tradeoff between the achievable compression rate and access performance: in general, better compression 
tends to occur by grouping large sets of nodes together before compression, yet such a strategy will often 
hurt access time by increasing the number of decompression operations needed to extract relevant document 
fragments. 

In this paper, we focus on the permutation-based approaches, and seek to determine the complexity 
of determining optimal strategies for container grouping and compression algorithm selection such that 



the resulting compression configuration maximizes the overall compression gain, while keeping compression 
and/or decompression time and compression model storage requirements within specified bounds. Arion 
et al [7] were the first to investigate (albeit informally) the tradeoff between compression rate and query 
performance, given a set of typical queries, a set of available compression algorithms, and a specific XML 
database as inputs. We consider a more general setting that captures the problem outlined in [7] as well 
as additional application domains, including data archiving and data exchange. We provide a complexity 
analysis indicating that the difficulty of selecting an optimal compression configuration is NP-hard, and 
also describe an approximation algorithm based on a branch-and-bound technique that finds the optimal 
compression configuration within polynomial time (w.r.t. the document size and the number of available 
compression algorithms), with the choice of appropriate parameter values. 

The paper is structured as follows. Section 2 provides preliminary definitions and a background into the 
problem. Section 3 investigates the difficulty of choosing an optimal tradeoff between compression gain and 
query performance. Section 4 describes an approximation algorithm for choosing a near-optimal compression 
configuration, while Section 5 concludes the paper and outlines our future work. 

2 Preliminaries 
2.1 XML Data Model 

We recall that an XML document can be represented as a rooted, ordered, labeled tree (the document 
tree), in which the leaf nodes correspond to attribute values and text segments (document content), while 
the interior nodes represent attributes and elements (document structure). According to convention, we 
distinguish attribute names from elements by prepending the former with '@'. As an illustrative example 
application, we consider a social recommendation website, where users shaxe their opinions of movies, music, 
etc. with other users. Additionally, users assign a prestige to other users, allowing them to express their 
evaluation of the quality of those users' recommendations. User account data is stored as XML; Fig. 1 shows 
a fragment of the document tree. 

Query languages for XML center around path expressions, which are used to specify subsets of nodes 
within the document tree. The two most influential XML query languages are XPath [12] and XQuery [13] . 

Example 1. For the example document tree of Fig. 1, the following XQuery returns the titles of movies rated 
at least 4.5 by users with a prestige ranking lower than 4. 

let $movies := for $user in doc(' 'ratings. xml' ')//user 
where $user/prestige It 4.0 
return $user/f avorites/movies/movie 

for $movie in $movies 
where $movie/rating ge 4.5 
return $inovie/title 

This query returns <title>Smoke</title>. 



2.2 XML Compression 

Permutation-based strategies for XML-conscious compression separately compress the document structure 
and text content. The textual content is organized into containers, usually based on the path (or just the 
name) of the parent element. The intuition for doing so is that values belonging to different instances of the 
same element are likely to exhibit similarities that facilitate compression. Fig. 2 shows the default path-based 
partitioning of the text content of the document tree in Fig. 1, in which data values belonging to each distinct 
element and attribute type stored in a separate container. 
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Fig. 1. Example XML document tree. 



Further compression gains can often be realized by generalizing the partitioning strategy to take into 
account additional factors, such as the data type of the content (e.g., integers, dates, and strings). Grouping 
together multiple containers with high pairwise similarity allows the containers to share the same compression 
source model, reducing storage costs while simultaneously allowing more complex models over the longer 
sequence to be built. Fig. 3 depicts a logical partitioning strategy that extends the default strategy from 
Fig. 2. Here, containers B, E, and H are grouped together, since user prestige, movie ratings, and song ratings 
are highly similar (i.e., they all consist of a real number value in the range [0.0, 5.0]). Similarly, since the 
titles of movies and songs and artist names are all free-form text, it may prove beneficial to group together 
containers C, F, and G. 

Yet compression gain is often not the only criterion guiding the selection of a container grouping strategy. 
The choice of a partitioning strategy can also impact the eSiciency of random access to nodes within the 
document tree. In particular, query performance can be improved by choosing a partitioning strategy that 
places data segments involved in a common query within the same container subset. Doing so can dramatically 
reduce the number of required decompression operations. For Ex. 1, a beneficial partitioning strategy might 
instead group together containers B, C, and E. 

Proper algorithm selection is also an important factor to consider. Greater compression can be realized by 
choosing a compression algorithm that is well-suited for the type of data values stored in a container subset. 
Query performance is also impacted by the choice of compression algorithm, as the time required to carry 
out decompression adds to the query response time. Fig. 3 additionally assigns a compression algorithm to 
each container subset (in this case, either LZ77 or Huffman coding). 

Furthermore, certain compression algorithms allow classes of operations to be carried out without prior 
decompression; the choic;e of such an algorithm can therefore speed up query performance. For example, 
using an order-preserving algorithm to compress user prestige and movie rating values would allow the 
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Fig. 2. A path-based partitioning of data values from the document of Fig. 1. 



comparisons in both where clauses of the XQuery in Ex. 1 to be computed within the compressed domain, 
without requiring the decompression of each such value beforehand. 



2.3 XML Compression Measures 

We now consider the relevant measures used to evaluate solutions to XML compression problems, and discuss 
the inherent trade-offs between these values. 

Storage gain measures the relative amount of space saved by applying a compression algorithm a to a 
container C, denoted as gain{C, a). An effective measure must not only account for the size of the compressed 
representation of C; it must also consider the additional space required to store auxilary data structures 
constructed by the compression source model (e.g., for the Huffman algorithm, this would indicate the size 
of the generated tree; in dictionary-based compression schemes, it would represent the size of the dictionary). 
It is calculated as 



gain{C, a) — 1 



compressed size of C + compression model size 
original size of C 



(1) 



This measurement is also applicable to sets of containers; given a subset C C and a compression 
algorithm a, gain{S, a) is calculated by first concatenating the contents of each container in S, and then 
using the compressed and original sizes of this concatenated container, together with the storage costs of the 
generated compression model, in the above formula. 

Compression cost and decompression cost measure, respectively, the time required to apply and reverse 
the compression process. Both time measures are largely dependent on the contents of the container(s) being 
compressed, as well as the compression algorithm being employed. Often, there is an inverse relationship 
between compression gain and compression/decompression time requirements, as the most effective compres- 
sion schemes tend to heavily transform the original sequence, while less sophisticated schemes that allow fast 
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Fig. 3. A compression configuration for tlie document in Fig. 1. 



compression and decompression typically fare worse in terms of compression performance. By comp(S, a) 
and decomp{S,a), we denote, respectively, the time required to compress and decompress the contents of a 
container subset S that has been compressed with algorithm a. 

In the sequel, we make the assumption that all three measures can be calculated in polynomial time 
(with respect to the size of the input container subset). Practical compression programs do feature a linear 
running time, meaning that in theory, an exact calculation of each measure can be obtained by simply 
running the compressor on the input. However, in many cases, the time and computational effort required 
to do this outweighs the benefit of obtaining a precise measurement (e.g., if the number of containers 
and/or compression algorithms is large), leading one to adopt faster and less expensive methods that provide 
reasonable estimates of compression gain, compression cost, and decompression cost. 

Trade-offs between compression gain and decompression cost. Across application domains, the primary mo- 
tivations for compressing XML differ. These goals typically dictate a tradeoff between compression gain and 
compression and/or decompression cost. We briefly examine the three most common use cases for XML 
compression. 

— Data archiving. This encompasses any application where large volumes of XML-encoded data must be 
preserved, yet accessed infrequently (e.g., web server logs). Here, the fundamental goal is to conserve 
disk space by maximizing the achievable compression gain, while much less emphasis is placed on time 
costs. This is because it is expected that each data set will only be compressed once, and only needs to 
be decompressed on the infrequent occasions that it is accessed. 

— Data exchange. In this class of applications, XML-encoded data is exchanged between multiple par- 
ties, typically over a network. XML documents are typically small in size and have a short life span 
(e.g., RPC-style web services, where an XML document only includes information related to a single 
service request or response. Once the document is processed, it is immediately discarded). Here, the use 
of XML compression seeks mainly to improve application throughput and reduce bandwidth consump- 
tion by reducing the average size of transmitted messages, while not imposing excessive additional costs 
for compression and decompression (with the rationale being that extra time spent compressing and 
decompressing messages must not outweigh the performance benefits gained by employing compression 
in the first place). 

— Database applications. In this scenario, the XML document is treated as a persistent data store over 
which queries are issued. In many instances, it is possible to anticipate the types of queries which will be 
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most coininoiily issued, which can l)c represented as a query workload. XML compression techniques are 
used primarily to improve query performance, by reducing the required number of disk reads/writes. As 
with any database application, query performance is paramount, and therefore minimizing decompression 
cost (particularly over the workload) becomes as important as maximizing the compression gain. Query- 
friendly compression schemes, allowing certain queries to be carried out directly on the compressed 
representation, are highly desirable as they can reduce the necessary number of decompression operations. 
Note that in our model, we can effectively capture the query performance for a given compression 
configuration over a particular workload within the decompression cost, as the latter measure is a function 
both of the chosen container partition strategy and the chosen compression algorithms for each container 
subset within the partition. 

3 Complexity Analysis of Compression Configuration Selection 

We recall from the discussion in Sec. 2.2 that our goal is to discover an optimal compression configuration, 
specifying both a partitioning strategy of the container set C and an assignment of a compression algorithm 
to each partition set. In this section, we demonstrate the NP-hardness of this problem. 

Definition 1. A configuration {P,a) consists of a partition P = {Si, . . . ,St] of C, and an algorithm as- 
signment function a : P A that assigns to each S ^ P a compression algorithm a ^ A. □ 

Definition 2. An instance of the optimization version of the optimal compression configuration problem 

consists of the following inputs: 

— a set of available corn,pression algorithms A = {ai, . . . , aq\; 

— a set of containers C = {C'l, . . . , C^}; 

— gain : 2*^ x A Q, a function indicating the compression gain obtained when a specific compression 
algorithm in A is applied to a specific container subset in 2^; 

— comp : 2'' X ^ ^ Q, a function indicating the time cost associated with a compression of a specific 
container subset in 2^ using a specific algorithm in A; 

— decomp : 2^ x A ^ Q, a function indicating the time cost associated with decompressing a specific 
container subset in 2^ that has previously been compressed using a specific algorithm in A; 

— T(., an upper bound on total compression cost; and 

— T4, an upper bound on total decompression cost. 

The goal is to discover a configuration {P, a) that maximizes 

gain{S,a{S)) 

Sep 

subject to 

^ comp{S,a{S)) < 
Sep 

and 

decomp{S,a{S)) < Ta . 

Sep 

In the decision version of the problem, there is an additional input L gQ* and a solver outputs "yes" if 
there exists a configuration {P, a) such that 

J2 gain{S,a{S)) > L 

Sep 

subject to the given constraints, and "no" otherwise. 
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Theorem 1. Selecting an optimal compression configuration is NP-hard. 

Proof. To show that optimal compression configuration is in NP, note that it is possible to efficiently verify 

that a given configuration (P, a) yields a "yes" answer by calculating the respective sums for the gain, camp, 
and decamp values of each {S, a{S)) pair and ensuring they obey the specified constraints for L, T^, and T^, 
respectively. 

NP-completeness follows from a polynomial-time reduction from SUBSET SUM, whoso decision version is 
known to be NP-complete [14]. Recall that an instance of the decision version of SUBSET SUM consists of 
a set Y = {yi, . . . ,yj.} of elements, a function s : Y Z* that assigns a score to each F-element, and a 
value _B e Z+. A solver outputs "yes" if there exists a subset Y' CY such that J^yev ^iv) ~ ^^'^ 
otherwise. 

The reduction proceeds as follows. From Y = {yi. . . . ,yr}, the container set C = {Ci, . . . ,Cr} is 
constructed. A contains a single compression algorithm a; for a specific container subset S* C C, we set 
gain{S, a) = comp{S, a) = decomp{S, a) — J^ceS ^i^)- Furthermore, L — T,, = Tj^ = B. The only compres- 
sion configurations yielding a "yes" answer correspond under this mapping to an instance of SUBSET SUM 
which would yield a "yes" answer from the SUBSET SUM solver. 

Since the decision version of the optimal compression configuration problem is NP-complete, it follows 
that its optimization version is NP-hard. □ 

As a consequence of the preceeding proof, we may also infer the following about the variant of optimal 
compression configuration selection where the same compression algorithm is applied to all container groups. 

CoroUciry 1. Selection of an optimal compression configuration remains NP-hard when \A\ = 1. 

This indicates that the "hardness" of the overall problem is not caused by algorithm selection, rather it 

is due to the difficulty of determining an optimal container partitioning strategy. Indeed, we can determine 
an optimal algorithm selection for a specified container partitioning strategy in 0(|^| • \C\) time by simply 
testing each available algorithm on each container subset. 

4 An Approximation Algorithm for Compression Configuration Selection 

In this section, we describe an approximation algorithm for selecting an optimal compression configuration. 
Throughout the discussion, we use the term container subset to refer to one or more containers which 
have been grouped together, and grouping to indicate a set of container subsets. A grouping which covers 
all containers (i.e., assigns each container to exactly one container subset) is referred to as a partitioning 
strategy. 

In the first phase of the approximation algorithm (Sec. 4.4), a branch-and-bound strategy is used to select 
a set of candidate partitioning strategies: a set of partitioning strategies which are estimated to be highly 
compressible. In the second phase (Sec. 4.5), this set of partitioning strategies is tested against the set of 
available compression algorithms to determine the single compression configuration that yields the highest 
compression gain, while obeying the specified upper bounds on compression and decompression costs. 

We recall from Eq. 1 that computing the compression gain of a container subset S is based on two addi- 
tional measures: the size of the compressed representation of S, and the additional storage cost incurred by 
the generated compression model. In the remainder of this section, we first describe how container compress- 
ibility and storage costs are estimated, and then discuss how these estimates are used in the computation of 
compression gains. We then detail both phases of the approximation algorithm. 

4.1 Estimating Compressibility 

In his classic paper [15], Shannon proved that the entropy rate r of a stationary stochastic process represents 
a bound on lossless compression of any message emitted by that process. For a set of random variables 
Xi,. . . , Xn whose values are drawn from a finite alphabet X, 
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r= lim i?(X„|X„_i,X„_2,...,Xi) 

n— ^oo 

where i?(X„|X„_i, X„„2: • • • j -'^i) denotes the conditional entropy of X„ when the values of 
X„_i,X„_2, . . . ,Xi have been witnessed. Essentially, each represents a separate message from the same 
source; as one receives more and more of these messages (i.e., n approaches infinity) they have more of a 
history to base the entropy estimate on, and hence the estimate will approach the true entropy value more 
closely. 

While Shannon's entropy rate does provide a theoretical lower bound on compressibility, it proves to be 
impractical for our setting. This is because the entropy rate is an asymptotic measure calculated by increasing 
the compression block size n to infinity, while this is clearly impossible to do when one's knowledge of the 
source consists of a single string of finite length. In other words, a single finite length string often does not 
provide enough opportunity to "learn" the source well enough to achieve an accurate measure of the true 
entropy. 

We instead turn to Lempel and Ziv's method for calculating string complexity [16]. In this approach, 
which we refer to as LZ76, the input string x is parsed once from left-to-right, and a set of phrases Vx are 
recursively built and added to a dictionary. Once parsing has been completed, the complexity of x is 



Clz(x) = ^ , (2) 

the ratio of phrases per character. Lempel and Ziv showed that this approach yields an approximation ratio 
of to Shannon's entropy rate. 

We now describe the parsing process of LZ76 in greater detail. 

1. Initialize the dictionary to be empty. 

2. If the end of x has been reached, terminate. Otherwise, read the next character from x and assign it 
to phrase p. If p matches an existing entry in the dictionary, continue reading characters from x and 
appending them to p until p no longer matches an existing dictionary entry. 

3. Assign p the next available index position, and add both the index value and p to the dictionary. Go to 
step 2. 

Example 2. For a container subset S with contents "aaabc" , the first iteration constructs the phrase (a) (step 
2) and adds it to the dictionary (step 3) at index position 1. The second iteration reads the next character 
('a') from x and assigns it to p. Since (a) is in the dictionary, the next character is read from x and appended 
to p to form the phrase (aa) , which does not appear in the dictionary. This phrase is added to the dictionary 
at index position 2. The following iterations construct phrases (b) and (c) and add them to the dictionary 
at index positions 3 and 4, respectively. We then compute the complexity of S as Clz{S) = 4/5 = 0.8. 



4.2 Estimating Storage Cost 

Obtaining a complete picture of the achieved storage gain via compression requires one to take into account 
not only the size of the compressed data itself, but also the additional space required to store information 
about the compression model used. Compression models typically consist of a mapping between the un- 
compressed symbol alphabet and the corresponding codewords assigned to each symbol by the compression 
algorithm. 

To compute the storage gain for a container subset, we simulate the cost of transmitting the dictionary 
using the coding strategy of LZ78 [17] (recalling that LZ78 utilizes the parsing strategy of LZ76 in concert 
with a specific coding strategy for dictionary phrases). Each time a new phrase of length I is constructed, 
two pieces of information are emitted to the compression stream: (1) a codeword W, representing the index 
position of the existing phrase p of length I — 1 that forms a prefix of the new phrase, and (2) the "innovative" 
character c that is appended to p to form the new phrase. Since phrase indexing begins at 1, the highest 



8 



index value for a dictionary witli f plirases will be t. Using a fixed-length encoding, then, we c;an (^xpress 
each W value using log2(f) bits, requiring a total of t ■ log2(t) bits to encode all codewords. Furthermore, a 
single character c is emitted each time a new phrase is created, requiring an extra 8 • t bits (here, we assume 

a text encoding that requires a single byte per character is in use; multibyte formats can be incorporated by 
replacing 8 with the number of bits per character used in the chosen encoding format). 

Definition 3. The storage cost (expressed in bits) associated with a container subset S is calculated as 



storageCost{S) ^ t ■ {8 + log2(t)) (3) 

where t is the total number of entries in the dictionary after an LZ76 parsing of S. 

The storage cost (expressed in bits) associated with a container grouping G is calculated as 

storageCost{G) = storageCost{S) , (4) 
seG 

namely, it is the sum of the storage costs of each container subset S contained within G. 



4.3 Modeling Compression Gain 

Two distinct gain measures are associated with each container grouping: the local compression gain (local- 
Gain) indicates the compression gain obtained by using the current grouping, while the maximum potential 
compression gain (mpGain) indicates the highest possible compression gain that can be obtained moving 
forward by chosing any partitioning strategy that "agrees with" the current grouping (i.e., there exists no 
container C such that the current grouping and the partitioning strategy place C within different container 
subsets). Both measures are used in the first phase of the algorithm to guide the search for candidate container 
partitioning strategies, and we presently describe how both measures are calculated. 

Definition 4. The local compression gain (expressed in bits) of a container subset S, denoted localGain(6'), 
is calculated as 

localGain{S) =max{0,r{S)} , (5) 
where 



r{S) = 8 • |5| - (Clz(S') • |5| + storageCost{S)) (6) 
and \S\ indicates the total byte length of the contents of S. 

Eq. 5 ensures that compression is only applied if it results in a positive compression gain; otherwise, the 

subset S is left uncompressed, and localGain{S) = 0. In Eq. 6, the sum of the estimated compressed size of 
S and the associated storage cost is subtracted from the original bit length of S. This quantity represents 
the total number of bits saved by applying compression to S. Note that while Eq. (6) assumes a byte-level 
compression of container contents, text encoding schemes using multiple bytes per character (e.g., Unicode 

formats) may be supported by considering each byte as an individual token. 

Definition 5. The local compression gain of a container grouping G is calculated as 



localGain{G) = localGain{S) . (7) 
seG 

Henc;e, the overall local compression gain for an entire container grouping (i.e. a set of container subsets) 
is simply calculated as the sum of local gains for each container subset within that grouping. 
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Input: D, the set of existing LZ76 dictionaries for the grouping G; Ct, total number of characters in all 

containers of C; c„, number of remaining unprocessed characters. 
Output: mpGain{G), indicating the maximum potential compression gain for G. 

1. Choose the dictionary d € D containing the phrase of longest length, and let Smax denote the container subset 
whose dictionary is d. In case of a tie, choose the subset with the lowest Clz value. Set nPhrases to be the 
number of phrase entries in d, and maxPhraseLength to be the length of the longest phrase, plus one. 

2. While Cu > maxPhraseLength, simulate the creation of a new, longer phrase by performing the following steps: 

(a) Set Cu = Cu — maxPhraseLength. This reduces the number of unprocessed characters to include only those 
not covered by the new phrase. 

(b) Set nPhrases = nPhrases + 1, to update the count of phrases in the dictionary d. 

(c) Set maxPhraseLength = maxPhraseLength + 1, to ensure the next created phrase (if applicable) will 
have a length one character longer than the current longest phrase. 

3. At this point, if any unprocessed characters remain (i.e., c„ > 0), this number is less than the length of the next 
new phrase to be created. To handle the remaining characters, we just choose an existing phrase of length Cu 
from d and no additional phrases will be added from this point. 

4. Compute CL,z{Smax) = "^''J'"""'' , and use this value to recalculate localGain{Smax)- 

5. Compute and return mpGam(G) = localGain{S) + localGain{Smax)- 

Algorithm 1: Calculation of maximum potential compression gain. 



Example 3. Recalling the example grouping S = {aaahc) from Ex. 2, r{S) = 5-8-(0.8-5+(4-(8+log 2(4)))) = 
—4 bits and therefore localGain{S) = bits, indicating that S should be left uncompressed. 

As mentioned above, the maximum potential compression gain is used to indicate the upper bound on 
the achievable compression gain for any partitioning strategy that agrees with the current grouping. Since 
the total number of characters (i.e., the number of characters contained within the existing grouping G, plus 
the number of characters contained within containers that have yet to be assigned to subsets) is fixed, so 
too is the first product in Eq. 6, and maximizing compression gain over a subset S then requires the sum 
of Clz('S') and storageCost{S) to be minimized. From Eq. 2 and Eq. 3, one observes that both quantities 
are minimized when the number of generated phrases is also minimized. Equivalently, at each step during 
LZ76 parsing, one seeks to generate the longest applicable phrase by appending an extra character to the 
longest existing phrase in the dictionary. Alg. 1 illustrates how the maximum potential gain is calculated for 
a grouping. 

In the first step, the longest phrase over all subset dictionaries is identified. For the container subset 
Smax whose dictionary contains this longest phrase, the existing dictionary is extended with longer phrases, 
until no unprocessed characters remain. More precisely, each iteration of step 2 creates a new phrase one 
character longer than the previous longest phrase, and applies it to the sequence of unprocessed characters. 
Eventually, either all remaining characters will be processed, or the number of remaining characters will be 
less than the longest phrase. In the latter case, a shorter existing phrase is reused to cover the remaining 
characters (step 3). Step 4 computes the new value of ChziSmax), and updates the value of localGain{Smax)- 
Finally, step 5 computes the mpGain for the grouping G by summing the updated localGain score for Smax 
with the existing localGain scores for the remaining subsets in G. 

Example 4- To illustrate the computation of mpGain, we recall from Ex. 2 the previous example subset 
S = {aaabc}, and the dictionary of phrases {(a), (aa), (6), (c)} that results from an LZ76 parsing of S. 
Assume that there is one additional container with 5 characters. Alg. 1 first selects the longest existing 
phrase (aa) and constructs a new phrase of length 3. Applying this to Cx leaves only 5 — 3 = 2 remaining 
unprocessed characters, a number which is less than 3, the current maximum phrase length. Therefore, the 
existing pattern (aa) is applied, and no unprocessed characters remain. Only one additional pattern has been 
created, and the new complexity score is 5/10 = 0.5 symbols per character, while the updated storage cost 
is 5 • (8 + log2(5)) « 51.6096 bits, and mpGain{S) « 10 • 8 - (0.5 • 10 + 51.6096) « 23.3904 bits. 
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Input: container Ci G C, context node x, and a threshold value 5 € R"*" 
Output: a set Q of candidate container partitioning strategies 

1. Construct as the leftmost child of x the grouping formed by adding the single-container subset {C,} to Gx- 

2. For each existing subset S G Gx, add a child to x corresponding to the grouping formed by Gx \ {S} U {S U Cj}. 

3. For each of the child nodes y created in steps 1 and 2, let Gy represent the grouping associated with y and 
calculate localGain{Gy) and rnpGain{Gy) . 

4. If one of the newly constructed children nodes y results in a localGain{Gy) value that is higher than optGain, 
set optGain to this value. 

5. "Kill" any child nodes y for which mpGain{Gy) < optGain — 5. 

Algorithm 2: Construction of the branch-and-bound search tree. 



4.4 Branch-and-Bound Algorithm for Selecting Candidate Partitioning Strategies 

In this phase, a search tree is constructed in which each node corresponds to a particular grouping. Each 
node stores the localGain and mpGain values for its associated grouping. The subtree rooted by a node n 
encompasses all groupings that extend the grouping associated with n by assigning additional containers to 
container subsets. 

Before explaining the details of the branch-and-bound procedure, we begin with an intuition as to why 

this technique is applicable to the subproblem of choosing a container grouping. Recall that the mpGain 
indicates the highest possible gain possible for any partitioning strategy based on the current grouping. In 
addition, we may also observe that mpGain{p) > mpGain{c) for any parent node p and child node c in the 
search tree. This is due to the fact that there arc fewer remaining unprocessed characters as one travels from 
p to c: in particular, the placement of one additional container has been "fixed" by the grouping associated 
with c. At the lowest level of the search tree, all containers have a fixed placement (i.e., each leaf node 
corresponds to a partitioning strategy), and therefore mpGain will equal localGain for each leaf node. 

Exploiting these properties of the mpGain measure provides us with our bounding criterion: if the mpGain 
for a grouping is sufficiently less than the best local gain value encountered thus far, the entire subtree rooted 
at the node representing the grouping can be immediately eliminated from consideration (or "killed"). We 
now are in a position to describe the specifics of the branch-and-bound procedure. 

The inputs to the procedure are a set of containers C, sorted in descending order of their respective 
sizes, along with an additional parameter S G M"*". The latter specifics a threshold value used to determine 
whether a particular node should be "killed" , or if it is worthwhile to continue branching into its subtree 
(in which case it is considered to be a "live" node). During the search procedure, the optimal local gain 
value encountered so far is stored in variable optGain. The root node of the search tree is assigned the 
grouping {Ci}, that is, a single set containing only the first container. For i = 2, |C|, the steps in Alg. 2 
are carried out to enumerate the various choices for placement of each container Cj within the context of an 
existing grouping (where each such choice corresponds to a child node of the existing grouping node), and 
to determine the optimal choice of placement among the alternatives. Note that in Alg. 2, x refers to the 
node currently being evaluated in the tree, and G^ refers to the container grouping associated with x. 

For each "live" node p at level i in the tree, a set of child nodes arc constructed; each represents a different 
strategy for placing the container Cj+i into either a new subset, or within one of the existing container subsets 
present in the grouping associated with p. Once all "live" nodes at level i have been branched, mpGain and 
localGain values for all nodes at level i arc computed, and if necessary optGain is updated to reflect a new 
global maximum for localGain. For each node having a localGain less than optGain, a test is carried out to 
ensure that its mpGain falls within the range [optGain — 5, optGain]. If the test fails, the node is "killed". 
Further branching is only carried out at level i + 1 on the remaining live nodes at level i; at each iteration, 
the unbranched node at level i with the highest mpGain value is chosen. At level \C\, the remaining live 
nodes will comprise the set Q of candidate partitioning strategies. 

We illustrate the working of the branch-and-bound procedure with the following example. 
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{C^],{C2],{C,] 
localGain = 100.9413 
mpGain = 100.9413 



localGain = 50.4707 
mpGain = 168.9804 



{Ci,C3},{C2} 

localGain = 126.3966 
mpGain = 126.3966 



{CJ 

localGain = 50.4707 
mpGain = 300.6970 





{CJ,{C2,C3} 

localGain = 50.4707 
mpGain = 50.4707 



localGain = 
mpGain = 108.6180 




{Ci,C2},{C3} 

localGain = 50.4707 
mpGain = 50.4707 



{Ci,C2,C3} 
localGain = 62.7933 
mpGain = 62.7933 



Fig. 4. Branch-and-bound search tree for Ex. 5. 



Example 5. Assume that wc have the container set C = {Ci, C2, C3}, where the respective contents of the con- 
tainers are Ci — {aaabcaaabcaaabcabcab}, C2 = {15720653197608243849}, and C3 — {abcababcbaaaabcabcab} 
We set S = 30.0 bits. Fig. 5 depicts the search tree formed by this process. Initially, the grouping {Ci} is 
formed as the root of the search tree. 

In the second level of the tree, both possibilities for incorporating container C2 into the existing grouping 
axe considered: either creating a second subset to store C2, or c;onibining C'2 with Ci in a single container 
subset. The left child of the root corresponds to the first choice, creating the grouping {Ci},{C2}, while 
the right child represents the second choice as the grouping {Ci,C2}. The gain values are then calculated 
for both children, and the test in Step 5 is performed which determines that neither node can be "killed". 
Therefore, child nodes are constructed for both nodes based on the possibilities for assigning container C3. 

In the case of the left child, there are three possibihties: assigning C3 to a third, separate subset to form the 
grouping {Ci}, {C2}, {C3}; appending C3 to the first existing subset to create the grouping {Ci, C3}, {C2}; 
or adding C3 to the second existing subset, creating the grouping {Ci}, {C2, C3}. Hence, three children are 
created within the left subtree. A similar process creates two child nodes in the right subtree, corresponding 
to the choice of creating a new subset for C3 (generating the grouping {Ci, C2}, {C3}), or combining it 
within the existing subset of the parent grouping (forming the grouping {Ci, C2, C3}. The best local gain is 
achieved by the grouping {Ci, C3}, {C2}; when we compare the mpGains of the other nodes at level 3, only 
{Ci}, {C2}, {Cs} comes within S = 30.0 bits of this optimal local gain. Hence, only these two nodes remain 
alive, and the other three are "killed" . Since all three containers have now been assigned, we return the two 
remaining live nodes at level three as the set of candidate partitioning strategies, Q. 

The pruning criterion in step 5 serves to reduce the size of the search space, yet it is crucial to ensure 
that it does not result in the removal of the node with the highest local compression gain (the optimal node). 
The following result proves that the optimum node will never be "killed" . 

Proposition 1. Alg. 2 ensures that the optimal node is always visited. 

Proof, (by contradiction) Assume that the supposed optimal node o occurs at level n in the tree. Let a be 
the highest ancestor of o that has been "killed" by the pruning criterion of step 5, let / be the (supposedly) 
false optimal node, and let I < n be the level at which a and / occur. Fig. 5 provides an illustration of 
this scenario. We first prove that for arbitrary a and o, mpGain{a) > localGain{o) . Assume instead that 
localGain{o) > mpGain{a). Since the groupings represented by a and o agree on the placement of the first I 
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Tree Level 




live node 
( ) killed node 



Fig. 5. Illustrating the proof of Proposition 1. 



containers, there must be at least one level fc, for A; G [? + 1, n], at which the placement of container Ck under 
the grouping associated with o yields a higher gain than the placement specified by the grouping associated 
with a. Without loss of generality, we first assume that a is the direct ancestor (parent) of o, and hence a 
and o agree on the placement of the first n — 1 containers, and only disagree on the choice of placement for 
container C„. Then 



localGain{o) = localGain{Si) + localGain{Sj) 

SieoACntSi SjeoACneSj 



> 



mpGain{a) = localGain{Si) + localGain{S'j) 

SieaACn^Si sreaAC„eS'j 

where Sj is the subset containing container C„ under grouping o, and S'j is the chosen subset for C„ chosen 
by Alg. 1 for the input grouping a. Simplifying, we obtain 



localGain{Sj) > localGain{Sj) 

which contradicts Alg. 1. Recall that at each step, Alg. 1 applies the longest possible phrase, guaranteeing 
that no other LZ76 parsing strategy could yield a lower complexity score Clz- This result can be generalized 
to the case where a is an indirect ancestor of o, since mpGain increases monotonically as one travels upward 
in a subtree. 
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Returning to the original claim, we note that if o is optimal, then it must be greater than the value of 
optGaiUn-i (i.e., the globally optimal localGain witnessed after n — 1 levels of the search tree have been 
processed). Additionally, since the ancestor node a has been "killed" during Step 5 of the branch-and-bound 
algorithm, it must also be true that 'mpGain(a) < optGaini — S (here, optGaini denotes the global optimum 
seen after / levels of the search tree have been visited). Since optGainn-i > optGaini, the only means of 
satisfying both conditions is for the following chain of inequalities to be satisfied: 

localGain{o) > optGainn-i > optGaini > mpGain{a) + 5 . 
This contradicts the previous result indicating that mpGain{a) > localGain{o) , completing the proof. 

□ 



4.5 Determining an Optimal Compression Configuration 

Alg. 3 allows one to determine an optimal compression configuration from an input set Q of candidate 
partitioning strategies (obtained from Alg. 2) and set A of compression algorithms, together with upper 
bounds on compression and decompression time, Tg and T^. The variable globalBestGain records the highest 
overall compression gain from the partitioning strategy/algorithm assignment combinations tested so far. 
Lines 2-29 iterate through each candidate partitioning strategy G G Q; each container subset S contained in 
G is tested (Lines 4-23) to determine the compression algorithm a G A that achieves the highest compression 
gain (Lines 6-14). Before an algorithm is assigned to a container subset, a test is performed to ensure that 
the required compression and decompression time values fall below the respective bounds Tc and (Line 
8). 

At the conclusion of testing, if there is no available algorithm in A that satisfies the time bounds for 
compression and decompression when applied to a specific subset S, the entire partitioning strategy con- 
taining S is immediately disqualified (Lines 15-16). Otherwise, the overall compression gain and compres- 
sion/decompression time scores are updated for the partitioning strategy G, and the appropriate compression 
algorithm is assigned to the active subset S (Lines 17-22). After each partitioning strategy G has been pro- 
cessed, a test is done to determine whether it yields a better gain than the current globalBestGain; if 
necessary, the globally-best compression configuration (P, arg) is updated to store the current partitioning 
strategy G, along with the optimal algorithm selection strategy ac found for G (Lines 24-28) . 

After all partitions in G have been processed, the optimal compression configuration (P, a) is returned 
(Line 30). 

4.6 Discussion &: Practical Considerations 

Comparison with greedy algorithm. At first glance, it might seem tempting to employ a simple greedy 
strategy for selecting a partitioning strategy. In terms of the branch-and-bound search tree, this corresponds 
to exploring only the root-to-leaf branch formed by selecting at each level the child node with the highest 
local gain value. While such a strategy is more time efficient, it is not too difficult to envision scenarios 
where it results in a sub-optimal partitioning strategy being chosen. As an example, we can consider an 
XML document consisting of DBLP-like bibliographic entries. At an earlier stage, both the branch-and- 
bound and greedy strategies may decide to place "author" and "year" containers in different subsets due 
to their dissimilarities. At a later stage, suppose that the container storing "key" values for entries is to 
be assigned, and further, assume that "key" values are formed by concatenating the author name with the 
year. This quite possibly causes a higher compression gain to be obtainable by grouping "author", "year", 
and "key" containers together. While the branch-and-bound strategy is capable of reconsidering the initial 
decision to separate "author" and "year" containers after observing the characteristics of future containers, 
the greedy strategy is not. 
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Input: set of compression algorithms A, set of candidate container partitions Q, upper bound Tc € Z"*" on 

compression time, upper bound £ Z+ on decompression time 
Output: a compression configuration {P, a) 

1 globalBestGain ^ 0; P <- NULL; alg <- NULL; 

2 foreach G £ Q do 



3 groupingCTime <— 0; groupingDTime <— 0; groupingOain <— 0; 

4 foreach S e G do 

5 maxGain <— 0; bestCTime •<— 0; bestDTime <— 0; bestAlgorithm <— NULL; 

6 foreach a £ A do 

7 gain ^ compressedGain{S,a); 

8 if goin > maxGain and groupingCTime + compressTime(S, a) < Tc and 
groupingDTime + decompressTime{S, a) < Ta then 

9 bestCTime <— compressTime{S, a); 

10 bestDTime ^ decompressTime{S,a); 

11 bestAlgorithm <— o; 

12 maxGain <— pom; 

13 end 

14 end 

15 if bestAlgorithm = NULL then 

16 goto line 4; 

17 else 

18 groupingCTime <— groupingCTime + bestCTime; 

19 groupingDTime <— groupingDTime + bestDTime; 

20 groupingCain <— groupingGain + maxGain; 

21 aG(5') <— bestAlgorithm; 

22 end 

23 end 

24 if groupingGain > globalBestGain then 

25 globalBestGain <— groupingGain; 

26 P ^ G; 

27 a <— ao; 

28 end 



29 end 

30 return (P, q); 

Algorithm 3: Selecting a compression configuration. 



Choosing a good 5 value. Choosing a value for 5 represents a tradeoff between accuracy and running time. 
Smaller 5 values will cause more nodes to be "killed" at each level in the search tree, reducing the size of 
the tree which must be explored. On the other hand, this also increases the likelihood that the true optimal 
configuration will not be chosen: a partitioning strategy may be discarded in phase one of the algorithm 
based solely on having a lower calculated local compression gain value, even though it may outperform all 
of the chosen candidate partitioning strategies in the second phase when a particular algorithm selection 
strategy is applied to it. Conversely, choosing a large enough 5 value ensures that the optimal configuration 
is always chos(Hi. at the potential expense of an exhaustive enumeration of all possible combinations of 
container groupings and algorithm selections, requiring time exponential in \C\. 

Ordering of containers. Proposition 1 illustrates that the optimal container partitioning strategy can never be 
"killed" , implying that the ordering of containers does not impact the discovery of the optimal grouping. Yet 
container ordering can affect the efficiency of the algorithm's first phase. In particular, sorting the containers 
in descending order of their sizes can dramatically reduce the number of tree nodes that are explored. Such 
an ordering causes a larger number of characters to have fixed container assigments at an earlier level in the 
tree, thereby reducing the number of unprocessed characters and allowing tighter bounds on mpGain to be 
established more quickly. As a result, larger subtrees can be pruned from the search tree. 
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5 Conclusion 



In this paper, we demonstrated that determining an optimal configuration for permutation-based XML 

compression is an NP-hard problem. Wc also described an approximation algorithm that allows one, with 
proper selection of parameter values, to discover the optimal compression configuration in polynomial time 
(w.r.t. the sizes of the document and the set of compression algorithms A). As future work, we plan to 
implement this algorithm within our existing XML-conscious compressor [8] and test its effectiveness via 
experimentation over a range of real-world and synthetic XML documents. 
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